Phylogenetic Trees with Open Tree of Life
Chapter 13
As in previous weeks, for today’s lab you are to create a new
RMarkdown document using the class lab report template.
Within your .Rmd document, include all you code, resulting
plots, and answers to questions below.
When you are done with your report, use knitr to convert it to .PDF format to submit on Canvas. It is important that you document each step of your workflow using comments and that you break up the sections of your analysis into SEPARATE code chunks.
In this chapter you will experiment with using the Open Tree of Life
in R. This Chapter was adapted from a tutorial by Luna L.
Sanchez Reyes and Emily Jane McTavish.
From within R, install the packages rotl,
ape, devtools and stringr with
the function install.packages(), and the package
datelife with the function
install_github().
Load them into your workspace with library() or
require().
When in doubt, follow the code chunk below.
Hint: If you do not want to load the packages, you can call functions
specifying their package using two colons and the syntax
package_name::function_name().
This implies more typing, but gives more clarity to reproduce the workflow later. So we will use that syntax for this tutorial.
An exception to this, are functions from packages that are
“preloaded” –such as library() from base, or
install.packages() from utils; all these can
simply be called by their name.
First install these packages by running the following command in the console (usually in the botom left panel in RStudio).
packages <- c("rotl", "ape", "devtools", "stringr")
install.packages(packages)
devtools::install_github("phylotastic/datelife")
devtools::install_github("phylotastic/datelifeplot")
library(rotl)
library(stringr)
library(datelife)
library(datelifeplot)
library(tidyverse)
library(phytools)
Scientific reproducibility is key for the advancement of Science. In this first episode, we will check that you have the same package versions that we will use throughout the tutorial.
We will use the function packageVersion from the utils package to register the package version we are using for this tutorial. It only takes a single element character vector as input, so you will have to type the function and the package name each time, as follows:
packageVersion("rotl")
packageVersion("ape")
packageVersion("devtools")
packageVersion("stringi")
packageVersion("datelife")
packageVersion("datelifeplot")
packageVersion("phytools")
Alternatively, you can create a character vector of package names and
use an lapply to get versions of all packages at once:
packages <- c("rotl", "ape", "devtools", "stringr", "datelife", "datelifeplot")
names(packages) <- packages
lapply(packages, packageVersion)
You’ll want to keep track of versions for your own purposes and also for reporting which version you used when you share your methods and results.
It is also useful to also print the R session info with
sessionInfo:
sessionInfo()
This shows you what version of R itself you’re using, as
well as what packages are loaded.
Open Tree of Life Taxonomy
The
Open
Tree of Life Taxonomy (OTT from now on) synthesizes taxonomic
information from different sources and assigns each taxon a unique
numeric identifier, which we refer to as the OTT id. To interact with
the OTT (and any other Open Tree of Life services) using R,
we will learn how to use the functions from the rotl
package.
To deal with synonyms and scientific name misspellings, the Open Tree Taxonomy uses the Taxonomic Name Resolution Service, that allows linking scientific names to a unique OTT id, while dealing with misspellings, synonyms and scientific name variants. The functions from rotl that interact with OTT’s TNRS start with “tnrs_”.
To get OTT ids for a taxon or set of taxa
(e.g a species, genus, family, or some other taxonomic level) we will
use the function tnrs_match_names(). This function takes a
character vector of one or more scientific names as main argument.
Do a tnrs_match_names() run for the amphibians
(Amphibia). Save the output to an object named
resolved_name. Try different misspellings and synonyms of
your taxon to see TNRS in action.
resolved_name <- rotl::tnrs_match_names(names = "amphibian")
resolved_name
Once you are comfortable with running the function
tnrs_match_names, move on to explore the structure of the
output.
As we can tell from the data printed to screen, the output of the
tnrs_match_names function is some sort of a data table. In
R (and all object-oriented programmming languages), defined
data structures called classes are assigned to objects. This makes data
manipulation and usage of objects across different functions much
easier. Redundantly, a class is defined as a data structure that is the
same among all objects that belong to the same class. However, we can do
more to understand the structure of any class. To get the name of the
class of the tnrs_match_names() output, we will use the
function class.
class(resolved_name)
As you can see, an object can belong to one or more classes.
Indeed, R is telling us that the output of
tnrs_match_names() is a data frame (a type of table) and a
‘match_names’ object, which is in turn a data frame with exactly 7 named
columns: search_string, unique_name,
approximate_match, ott_id,
is_synonym, flags, and
number_matches.
Next, we will explore the kinds of data that are stored in each of the columns of a ‘match_names’ object.
You should have a good idea by now of what type of data is stored in
the ott_ids column.
Questions:
Can you guess what type of data is displayed in the column
search_string and unique_name?
How about is_synonym?
The column approximate_match tells us whether the unique
name was inferred from the search string using approximate matching
(TRUE) or not (FALSE).
Finally, the flags column tells us if our unique name
has been flagged in the OTT (TRUE) or not (FALSE). It also indicates the
type of flag associated to the taxon. Flags are markers that indicate if
the taxon in question is problematic and should be included in further
analyses of the Open Tree workflow. You can read more about flags in the
Open
Tree wiki.
Now we know what kind of data is retrieved by the
tnrs_match_names() function. Pretty cool!
Now that we know about classes and the data structure of the
tnrs_match_names output, we will learn how to use the
tnrs_match_names function for multiple taxa. In this case,
you will have to create a character vector with your taxon names and use
it as input for tnrs_match_names:
Do a tnrs_match_names() run for the amphibians
(Amphibia), the genus of the cat (Felis), the family of
dolphins (Delphinidae), and the class of birds (Aves). Save the output
to an object named resolved_names.
Again, you can try different misspellings and synonyms of your taxa to see TNRS in action.
my_taxa <- c("amphibians", "felis", "canis, "delphinidae", "aves", "primates")
resolved_names <- rotl::tnrs_match_names(names = my_taxa, context_name = "All life")
resolved_names
You should get a matched named for all the taxa in this example. If
you do not get a match for all your taxa, and you get an unexpected
warning message, it means that the tnrs_match_names
function might not be working as expected.
Next, we are going to learn how to extract specific pieces of data from a ‘match_names’ object to use in other functions and workflows.
It is easy to access elements from a ‘match_names’ object using it’s column name, like this:
resolved_names$unique_name
The ‘match_names’ object has a relatively simple structure that is
easy to explore and mine. We will see later that the outputs of other
rotl functions are more complicated and accessing their
elements requires a lot of hacking. Fortunately, the rotl creators have
added some functions that allow interacting with these complicated
outputs. The functions unique_name(),
ott_id(), and flags() extract values from the
respective columns of a ‘match_names’ object, in the form of a list
instead of a vector. To extract data from the other columns there are no
specialized functions, so you will have to index.
Challenge 1
You now have a ‘match_names’ object that we called
resolved_names. Extract the OTT ids from it and store them
in an object we will call my_ott_ids.
Hint: use your knowledge on data frames and tables to
extract the data from the ott_id column.
If you just want to look a partiular row, you can use the
inspect() function from rotl
rotl::inspect(resolved_names, taxon_name = "Delphinidae")
Challenge 2
Can you think of another way to extract the Delphinidae row?
Hint: consider using data frame wrangling from the
tidyverse package.
Now you know how to get OTT ids from a bunch of taxa of interest. Next, we’ll explore what exactly you can do with these.
Building a Synthetic Tree
The synthetic Open Tree of Life (synthetic OpenTree from now on) summarizes information from trees from peer-reviewed and published studies that have been uploaded to the OpenTree database (the Phylesystem) through a curator system.
Functions from the rotl package that interact with the
synthetic OpenTree start with tol_.
To access general information about the current synthetic OpenTree,
we can use the function tol_about(). This function requires
no argument.
rotl::tol_about()
Question:
The tree is huge! So, what if we just want a small piece of the whole synthetic OpenTree? Well, now that we have some interesting taxon OTT ids, we can easily do this.
The function tol_induced_subtree() allows us to get a
tree of taxa from different taxonomic ranks.
resolved_names$ott_id
my_tree <- rotl::tol_induced_subtree(ott_ids = resolved_names$ott_id)
Let’s look at the output of tol_induced_subtree().
my_tree
R is telling us that we have a rooted tree with no
branch lengths and 5 tips. If we check the class of the output, we will
verify that it is a ‘phylo’ object.
class(my_tree)
A ‘phylo’ object is a data structure that stores the necessary
information to build a tree. There are several functions from different
packages to plot trees or ‘phylo’ objects in R (e.g.,
phytools). For now, we will use the one from the legendary
ape package plot.phylo():
ape::plot.phylo(my_tree, cex = 2) # or just plot(my_tree, cex = 2)
This is cool!
Now, what if you want a piece of the synthetic OpenTree containing all descendants of your taxa of interest?
We can extract a subtree of all descendants of one taxon at a time
using the function tol_subtree() and an OTT id of your
choosing. Let’s extract a subtree of all amphibians.
First, get its OTT id. It is already stored in our
resolved_names object:
amphibia_ott_id <- resolved_names %>% dplyr::filter(unique_name == "Amphibia") %>% select(ott_id)
amphibia_ott_id
Now, extract the subtree from the synthetic OpenTree using
tol_subtree().
amphibia_subtree <- rotl::tol_subtree(ott_id = amphibia_ott_id$ott_id)
Let’s look at the output:
amphibia_subtree
This is a large tree! We will have a hard time plotting it.
We can extract and plot a subtree for the genus Felis It should be way smaller!
subtree <- rotl::tol_subtree(resolved_names["Felis",]$ott_id)
ape::plot.phylo(subtree)
There is not a specific function in the package rotl
that gets all taxa from a given taxonomic rank. We will now shift to the
datelife package and use the
get_ott_children() function, that extracts OTT ids of all
taxa from a rank specified by the argument ott_rank.
Let’s get all amphibian families.
amphibia_families <- datelife::get_ott_children(ott_ids = amphibia_ott_id$ott_id, ott_rank = "family")
class(amphibia_families)
Note that amphibia_families (resulting from
datelife::get_ott_children()) is a list. One good way to
explore lists is to view it in RStudio. Enter the following
View() command into the console.
View(amphibia_families)
Now we can get the induced subtree using the amphibian families’ OTT ids.
Challenge 3
Explore View(amphibia_families) and you’ll see that the
OTT IDs are nested in the list structure. There are at least two
different ways to extract them from the list, see if you can figure one
out to generate the Amphibia subtree:
amphibia_families_subtree <- rotl::tol_induced_subtree(<insert ott ids here>)
Questions:
What is the structure of amphibia_families?
Explain.
How many tips does the amphibian families phylo object (tree) have?
Plot the amphibian families tree! Look at the arguments of
function tol_induced_subtree() with
??tol_induced_subtree() to plot without OTT IDs in the tip
labels.
Say you are interested in understanding and visualizing the time frame in which the tips and ancestors on your tree diverged and diversified. Standard phylogenetic trees are generally phylograms, in which branch lengths correspond to divergence (i.e. how much change has occurred since the branches split from a common ancestor). To know how much time has passed since branches or species split from a common ancestor, a model (such as a molecular clock model) must be applied to convert the phylogram into a chronogram (also known as a timetree.
In rotl, you can search the chronogram database to try
and build a synthetic chronogram of your taxon of interest.
Let’s build a synthetic chronogram of Canis.
First, we need to find the OTT ID of the most recent common ancestor (MRCA) of Canis.
canis_node_info <- rotl::tol_node_info(canis_ott_id$ott_id)
canis_mrca <- canis_node_info[["node_id"]] # this will get us the actual ID itself rather than data frame or a list
In the package datelife, we have implemented a workflow
that extracts all studies containing information from at least two taxa.
You can get all source chronograms from an induced subtree, as long as
the tip labels are in the “name” format (and not the default
“name_and_id”).
datelife takes as input either a tree with tip labels as
scientific names (and not names and ids), or a vector of scientific
names.
Get a Canis subtree with tip labels that do not contain the OTT id.
canis_node_subtree <- rotl::tol_subtree(node_id = canis_mrca, label = "name")
canis_node_subtree
Now, you can use that tree as input for the
get_datelife_result() function.
canis_dr <- datelife::get_datelife_result(canis_node_subtree)
We have now a list of matrices storing time of lineage divergence data for all taxon pairs. Lists are named with the study citation, so we have that information handy at all times.
Let’s explore the output.
class(canis_dr)
Notice canis_dr is class list. Lists are very
useful data formats in R! RStudio makes it easy to click
around the list object and see what it contains. Try this:
View(canis_dr)
To access elements of a list, you can either use indexing (the
numerical position in the list) or, if the elements are named, the name.
Let’s look at the first element of the canis_dr list.
canis_dr[1]
A shortcut to look at the last element of the list is to take
advantage of the function length().
canis_dr[length(canis_dr)]
Then, it is really easy to go from a matrix to a tree, using the
function summarize_datelife_result() with the option
summary_format = "phylo_all". Note the printed output
returns a summary of taxa that have branch length information in the
database.
canis_phylo_all <- datelife::summarize_datelife_result(canis_dr, summary_format = "phylo_all")
To plot the resulting tree, you can use the plot.phylo()
function from ape. You can also use
datelifeplot functions, such as
plot_phylo_all() in the Console for an interactive plot
that adds the study citation as title, as well as a
geo-chrono-stratigraphic axis for a time reference.
Run the following in the console:
datelifeplot::plot_phylo_all(chronograms = canis_phylo_all, plot_height = 4, plot_width = 6)
Plot each tree sparately using list indexing.
ape::plot.phylo(canis_phylo_all[<insert a number here>])
Now that we have a collection of chronograms containing our taxa of interest, we can go on to summarize the information in them into a single chronogram.
There is no consensus on the best way to do this, but the fastest one is using the median of node ages for each node with available information, and then evenly distributing ages across nodes.
canis_phylo_median <- datelife::summarize_datelife_result(canis_dr, summary_format = "phylo_median")
Now plot your chronogram!
ape::plot.phylo(canis_phylo_median, cex = 1.2)
# Add the time axis:
ape::axisPhylo()
# And a little hack to add the axis name:
graphics::mtext("Time (myrs)", side = 1, line = 2, at = max(get("last_plot.phylo",envir = .PlotPhyloEnv)$xx) * 0.5)
Questions:
What is the approximate age of the MRCA of Lycalopex?
What is the approximate age of the MRCA of all the members of your Canis tree?
Produce a tree, with names on the tips, of a group of interest. This can be related to your resarch or any other organism you want.
Great Work!
This lab activity was originally written by Luna L. Sanchez Reyes and Emily Jane McTavish (see https://github.com/McTavishLab/R_OpenTree_tutorials/tree/gh-pages) and modified by Jenna Ekwealor.
Open Tree of Life aims to construct a comprehensive, dynamic and digitally-available tree of life by synthesizing published phylogenetic trees along with taxonomic data. The project is a collaborative effort between 11 PIs across 10 institutions. Funded by NSF ABI #1759838, NSF AVAToL #1208809. More information and tutorials at opentreeoflife.org.